Efficient 2-Body Statistics Computation on GPUs: Parallelization & Beyond
نویسندگان
چکیده
Various types of two-body statistics (2-BS) are regarded as essential components of data analysis in many scientific and computing domains. Due to the quadratic time complexity, use of modern parallel hardware has become an obvious direction for research and practice in 2-BS computation. This paper presents our recent work in designing and optimizing parallel algorithms for 2-BS computation on Graphics Processing Units (GPUs). First, we classify 2-body applications into three groups based on their data output pattern. Then, we introduce a straightforward parallel algorithm under the CUDA framework. The unique architecture of modern GPUs, however, provides abundant opportunities for optimizing the algorithm. To that end, we split the algorithm into two stages: pairwise distance function computation and writing output. Then, we present modifications to the basic algorithm by integrating various techniques at each stage. Since the architecture of modern GPUs is much more complex than that of multi-core CPUs, traditional wisdom on decomposing problems in a parallel platform is often insufficient in developing GPU-based algorithms. Therefore, our algorithms design focuses on effective use of hardware/software features that are unique in GPU platforms. In addition to the various programming cache and atomic operations, we also introduce novel load balancing and register content sharing techniques. We develop models to analyze such techniques and identify the best ones for each type of 2-BS. Experiments run on modern GPU hardware show that our GPU algorithms outperform the best known CPU program by at least an order of magnitude in various applications. Furthermore, our implementation achieves very high level of GPU resource utilization, indicating nearoptimal performance. This work builds a solid foundation towards realizing our vision of a framework that can automatically generate optimized code for any new 2-BS problems.
منابع مشابه
Efficient parallelization of the genetic algorithm solution of traveling salesman problem on multi-core and many-core systems
Efficient parallelization of genetic algorithms (GAs) on state-of-the-art multi-threading or many-threading platforms is a challenge due to the difficulty of schedulation of hardware resources regarding the concurrency of threads. In this paper, for resolving the problem, a novel method is proposed, which parallelizes the GA by designing three concurrent kernels, each of which running some depe...
متن کاملThe motivation of this project is to develop an efficient parallelized simulation of large multi-particle systems, which can be used directly to problems like the sphere equidistribution[] and N-body problem with fairly uniform spatial distribution
The motivation of this project is to develop an efficient parallelized simulation of dynamics of large multi-particle systems on a parallel CPU/GPU high performance computing architecture. The work includes three parts: parallelization of the fast multipole method (FMM), multi-scale time stepping integrators and mapping the computation of nearby particles forces to GPUs. Possible applications o...
متن کاملAn approach to Improve Particle Swarm Optimization Algorithm Using CUDA
The time consumption in solving computationally heavy problems has always been a concern for computer programmers. Due to simplicity of its implementation, the PSO (Particle Swarm Optimization) is a suitable meta-heuristic algorithm for solving computationally heavy problems. However, despite the simplicity, the algorithm is inefficient for solving real computationally heavy problems but the pr...
متن کاملEfficient irregular wavefront propagation algorithms on hybrid CPU-GPU machines
We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elemen...
متن کاملMPI- and CUDA- implementations of modal finite difference method for P-SV wave propagation modeling
Among different discretization approaches, Finite Difference Method (FDM) is widely used for acoustic and elastic full-wave form modeling. An inevitable deficit of the technique, however, is its sever requirement to computational resources. A promising solution is parallelization, where the problem is broken into several segments, and the calculations are distributed over different processors. ...
متن کامل